Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32467: nodepool_controller: add a reconciler for cleanup #3969

Conversation

stevekuznetsov
Copy link
Contributor

When we change the config for a NodePool or the way in which we hash the values, we leak token and user data secrets. However, since these secrets are annotated with the NodePool they were created for, it's simple to check that the Secret still matches what we expect and clean it up otherwise.

/assign @sjenning

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 1, 2024
@openshift-ci-robot
Copy link

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-32467, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When we change the config for a NodePool or the way in which we hash the values, we leak token and user data secrets. However, since these secrets are annotated with the NodePool they were created for, it's simple to check that the Secret still matches what we expect and clean it up otherwise.

/assign @sjenning

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 1, 2024
@openshift-ci openshift-ci bot requested review from hasueki and isco-rodriguez May 1, 2024 00:57
@openshift-ci openshift-ci bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 1, 2024
@stevekuznetsov stevekuznetsov force-pushed the skuznets/clean-up-ignition-secrets branch 2 times, most recently from afe605c to 000f5f8 Compare May 1, 2024 13:32
return ctrl.Result{}, err
}

targetPayloadConfigHash := supportutil.HashSimple(config + targetVersion + pullSecretName + globalConfig)
Copy link
Member

@enxebre enxebre May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally this would be forced by a function signature or similar since we are using this formula on creation as well, otherwise there's a divergence risk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was hoping there would be a suggestion on this. They're all strings so even a signature ends up being stringly typed - WDYT?

Copy link
Member

@enxebre enxebre May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh not ideal but I guess having a signature is still slightly better, say you are modifying the code in the main loop, you add a new parameter. Binary won't compile unless you include the same number of parameters here. But then yeh you could satisfy it by introducing the wrong string here.
Maybe start with a signature and then refactor a bit towards having a single func targetHash(nodePool) (string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func targetHash(nodePool) string is not really possible since getting all the data for it requires a lot of other actions - and, even if we had it hang off of the reconciler as a method, the main reconciliation method ends up using a lot of the intermediate calculated steps in other parts of the method, so it would be something like func (r *NodePoolReconciler) targetHash(nodePool) (expectedCoreConfigResources, releaseImage, ... hash). I can do that but it seemed awkward.

@enxebre
Copy link
Member

enxebre commented May 2, 2024

It's not clear to me what would cause this issue:
1 - The jira says

"During HyperShift operator updates/rollout, previous ignition-server token and user-data secrets are not properly cleaned up and causing them to be abandoned on the control plane."

How would the HO upgrade cause this? Can you articulate this?

2 - The PR desc says

When we change the config for a NodePool

This is a valid use case, and it should be already covered by dropping this todo now https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/nodepool/nodepool_controller.go#L780-L793 for the user data, and by the https://github.com/openshift/hypershift/blob/main/ignition-server/controllers/tokensecret_controller.go#L159-L170 for expired tokens.
Agree/disagree?

3 - The PR desc says

or the way in which we hash the values,

This I can see as the main valid use case for this pruner.

@stevekuznetsov
Copy link
Contributor Author

@enxebre I think for 2) the code you linked is still susceptible to leaks since you'd have to ensure the cleanup happened before the NodePool changed again, and we don't have a transactional way to do that.

return ctrl.Result{}, nil
}

log.WithValues("options", names, "valid", valid).Info("removing secret as it does not match the expected set of names")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By deleting the token secret synchronously here there's a risk we break inflight booting instances about to run ignition. That's in part why we set it for expiration here

if err == nil {
if err := setExpirationTimestampOnToken(ctx, r.Client, tokenSecret); err != nil && !apierrors.IsNotFound(err) {
return ctrl.Result{}, fmt.Errorf("failed to set expiration on token Secret: %w", err)
}
}

cc @relyt0925

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre in order to annotate it correctly, we need to know that the secret we're reconciling is a token secret. Do you want us to add something machine-readable to identify them, or is some string prefix check on the name sufficiently safe?

@stevekuznetsov stevekuznetsov force-pushed the skuznets/clean-up-ignition-secrets branch from 000f5f8 to 86bb63e Compare May 3, 2024 15:46
@stevekuznetsov
Copy link
Contributor Author

@enxebre took a stab at making it harder to accidentally change one hash and not both, and updated to mark tokens for expiry

@enxebre
Copy link
Member

enxebre commented May 6, 2024

Thanks! Dropped a few more comments, lgtm otherwise. Let's run through ibm folks afterwards.

@stevekuznetsov stevekuznetsov force-pushed the skuznets/clean-up-ignition-secrets branch from 86bb63e to 986495f Compare May 6, 2024 16:39
@stevekuznetsov
Copy link
Contributor Author

@enxebre updated for your comments - please let me know if you find the comment in #3969 (comment) convincing.

@enxebre
Copy link
Member

enxebre commented May 7, 2024

/approve
cc @hasueki @relyt0925 PTAL

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2024
Copy link
Contributor

openshift-ci bot commented May 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2024
@stevekuznetsov stevekuznetsov force-pushed the skuznets/clean-up-ignition-secrets branch from 986495f to 2cdf40a Compare May 8, 2024 13:40
@openshift-ci-robot
Copy link

/hold

Revision b071aaf was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2024
@stevekuznetsov stevekuznetsov force-pushed the skuznets/clean-up-ignition-secrets branch from b071aaf to 21bfd1a Compare May 28, 2024 17:20
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
@stevekuznetsov stevekuznetsov force-pushed the skuznets/clean-up-ignition-secrets branch from 21bfd1a to 964c6ae Compare May 28, 2024 18:25
@stevekuznetsov
Copy link
Contributor Author

/retest

@stevekuznetsov
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2024
@stevekuznetsov
Copy link
Contributor Author

/retest

@csrwng
Copy link
Contributor

csrwng commented May 30, 2024

/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

Copy link
Contributor

openshift-ci bot commented May 30, 2024

@csrwng: Overrode contexts on behalf of csrwng: Red Hat Konflux / hypershift-operator-main-on-pull-request

In response to this:

/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@stevekuznetsov
Copy link
Contributor Author

/retest

When we change the config for a NodePool or the way in which we hash the
values, we leak token and user data secrets. However, since these
secrets are annotated with the NodePool they were created for, it's
simple to check that the Secret still matches what we expect and clean
it up otherwise.

Signed-off-by: Steve Kuznetsov <[email protected]>
@stevekuznetsov stevekuznetsov force-pushed the skuznets/clean-up-ignition-secrets branch from 964c6ae to 41c0b63 Compare June 2, 2024 17:05
@csrwng
Copy link
Contributor

csrwng commented Jun 4, 2024

/lgtm
/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD e66add8 and 2 for PR HEAD 41c0b63 in total

Copy link
Contributor

openshift-ci bot commented Jun 4, 2024

@stevekuznetsov: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure 41c0b63 link false /test e2e-azure
ci/prow/e2e-kubevirt-azure-ovn 41c0b63 link false /test e2e-kubevirt-azure-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD be9e3f8 and 1 for PR HEAD 41c0b63 in total

@openshift-merge-bot openshift-merge-bot bot merged commit ca90be6 into openshift:main Jun 5, 2024
11 of 14 checks passed
@openshift-ci-robot
Copy link

@stevekuznetsov: Jira Issue OCPBUGS-32467: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32467 has been moved to the MODIFIED state.

In response to this:

When we change the config for a NodePool or the way in which we hash the values, we leak token and user data secrets. However, since these secrets are annotated with the NodePool they were created for, it's simple to check that the Secret still matches what we expect and clean it up otherwise.

/assign @sjenning

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@stevekuznetsov: #3969 failed to apply on top of branch "release-4.16":

Applying: nodepool_controller: add a reconciler for cleanup
Using index info to reconstruct a base tree...
M	hypershift-operator/controllers/nodepool/nodepool_controller.go
M	hypershift-operator/controllers/nodepool/nodepool_controller_test.go
M	hypershift-operator/controllers/scheduler/autoscaler_test.go
Falling back to patching base and 3-way merge...
Auto-merging hypershift-operator/controllers/scheduler/autoscaler_test.go
Auto-merging hypershift-operator/controllers/nodepool/nodepool_controller_test.go
CONFLICT (content): Merge conflict in hypershift-operator/controllers/nodepool/nodepool_controller_test.go
Auto-merging hypershift-operator/controllers/nodepool/nodepool_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 nodepool_controller: add a reconciler for cleanup
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/jira refresh
/cherry-pick release-4.16 release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 25, 2024
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines.
https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777
kubernetes-sigs/cluster-api-provider-aws#3805

We regress that behaviour here openshift#3969

This PR fixes that by statically checking the hc release version.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 25, 2024
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines.
https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777
kubernetes-sigs/cluster-api-provider-aws#3805

We regress that behaviour here openshift#3969

This PR fixes that by statically checking the hc release version.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 25, 2024
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines.
https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777
kubernetes-sigs/cluster-api-provider-aws#3805

We regress that behaviour here openshift#3969

This PR fixes that by statically checking the hc release version.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 25, 2024
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines.
https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777
kubernetes-sigs/cluster-api-provider-aws#3805

We regress that behaviour here openshift#3969

This PR fixes that by statically checking the hc release version.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 25, 2024
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines.
https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777
kubernetes-sigs/cluster-api-provider-aws#3805

We regress that behaviour here openshift#3969

This PR fixes that by statically checking the hc release version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants